Spring 2021 - Social Data Analysis and Visualization (02806) - DTU

SF311 - Final Project

Table of Contents

1. Motivation

Exploring the Open Data database of San Francisco we chose the SF311 dataset because it gives valuable insight into the public life of San Francisco residents. Further, the dataset and the vast amount of information it contains, is highly relevant to several departments within the government of San Francisco. The dataset contains service requests within different categories filed by the citizens of San Francisco. For each complaint the dataset contains a lot of information such as; geographical location, time of complaint, category, source of complaint, and much more. With this vast amount of information our main purpose is to communicate key insights from the dataset in a clear and structured manner. Additionally, our goal is for the reader to feel like he is on a guided tour of the SF311 dataset leaving the tour with valuable insight into how the complaint categories develop over time, distribute geographically across the city and whether some neighbourhoods are similar in the concentration of different complaint types.

2. Basic Statistics

2.1 Data Preperation

2.1.1 Data and Basic Statistics

We start by importing the libaries and the main main dataset used in this project. It is composed by around 4.8 million observations and 47 variables, occupying a total size of 2.1 GB. Below is printed a list of all the variables the dataset provides, as well as a preview of the first rows of it.

The dataset basic statistics are briefly outlined in the following overview:

But we can even look at the representation of the top 8 complaint categories.

The other columns of interest are as follows.

Importing GeoData

We decide that the Analysis Neighbourhood is the best compromise between capturing details and communicating a bigger picture, so we import geojson data that contain the geometry of 41 neighbourhoods in SF. The SF311 dataset has the same neighbourhoods in the column, but they are numbered 1-41, so the first task was couple the geojson neighbourhood names with the corresponding neighbourhoods in SF311 dataset. This was done by plotting the longitudes and latitudes from both dataset and then manually write down which corresponded to each other.

2.1.2 Cleaning and Filtering

We are going to narrow down and prepocess the data, before the data analysis can begin.

Making relevant datetime columns

First we make datetime columns, so we can easily access the recorded times. We will not analyse request processing time, so we will only use the ‘Opened Time’, refering to when the request was registered by 311.

Deleting half years

We exclude the years 2008 and 2021 since these are incomplete years.

Filtering and merging the categories

As we explained e can choose to focus on the 20 most requested ones, which we have calculated to represent 92.6% of all the requests, or the top 10 complaints, which represent the 77.7% of all the requests.

We are also interested in the greater themes of complaints, so we went through the categories and merged selected ones, for example the four different categories that relates to the MUNI feedback, into one.

2.2 Preliminary Explorations and Conclusions

The histogram above illustrates one of the first findings of our exploratory data analysis. By exploring the geographical distribution of several service request categories, using histograms plotting request count over longitude and latitude, we initially found that the requests for each category are distributed differently across the city of San Francisco. This initial finding lead us to want to investigate how the geographical distribution of complaints changes over time, ie. years, months and hours of the day.

Secondly, by plotting the distribution of complaints in geographical scatterplots, we found that some areas have higher concentrations of certain request categories. Further, we implemented a choropleth map with police districts, where we found that some districts have higher concentrations of certain request types as well. These findings led us to implement a choropleth map with neighbourhoods as well, allowing us to investigate whether some neighbourhoods have higher concentrations of certain request types as well. Additionally, these findings sparked an investigation into whether neighbourhoods can be clustered into meaningful clusters of similar neighbourhoods.

To control the size of this notebook, these Folium plots were not included. However, the final implementation of the choropleth plot and clustering algorithm can be found in part 3.3.2

3. Data Analysis

3.1 Temporal Patterns

The first visualization focuses on the total count of complaints by the different categories these fall into. In the whole dataset there is a total number of 103 different categories, although the majority of them are very poorly represented. In fact, the top 20 categories (around a 20% of all the different categories found in the dataset) represent the 92.6% of all the complaints. In the plot below there is a visual representation of these top 20 categories, what is their total count value, and they can be easily compared between them.

3.1.1 Development over time

Narrowing it down even more to the top 10 categories, they account for 77.7% of the total complaints of the whole dataset. The plot below has been designed focusing on the user experience and in order to offer the maximum level of interaction. Two main modes are introduced offering a more in depth insight of each complaint category evolution over time. In addition, a time window widget has been added right below the plot, so it makes it easier to navigate through the data, independently if it is years or months you want to look into.

The aim of the following sets of plots is to analyze and visualize the complaint development over time in San Francisco. By looking at the dataset from these particular perspectives we intend to identify and detect complaint trends taking into consideration months, days of the week, or hours of the day.

Complaint count by week day

The bar charts below show the number of complaints in the city grouped by each day of the week.

In the plots above several interesting trends can be observed. In general all of the 10 categories analyzed show a decrease when the weekend comes, being sunday for all of them the day with fewer complaints registered. This pattern however cannot be observed for Parking Enforcement, which has a rather constant level of complaints throughout the week. It is interesting to see how Damaged Property suffers a significant drop on Sunday, probably due to the fact that Sunday is the day of the week where most people are resting at home.

Complaint count by month

We now turn our attention to the number of complaints in the city grouped by months.

It can be observed that Encampments are slightly more frequent between July and October, which makes sense because these are summer holidays months and with a nice weather. Looking at Street and Sidewalk Cleaning there is a pronounced peak in January, most likely due to new years celebrations and major events. Moving to the Sewer Issues category, it can be observed that there is a growth in complaints in the winter months, peaking in December and January.

Complaint count by 24h cycle

Below we break down the complaint information into the 24 hours of the day.

At first glance it is clear to see that the number of complaints fluctuates throughout the 24-hour cycle for all categories. There is a consistent pattern across all the categories shown, and that is that between 12am and 5am the number of complaints recorded is minimal. Most of the categories peak in the morning between 8am and 10am. Some of them also show double peaks in the morning and in the afternoon, which matches with the working hours, and it is dependant on the nature of the complaint itself. For instance Parking Enforcement makes sense to show this pattern as the hours where it peaks is when people moves more, to and from home to work.

Complaint count by hour of the week

Finally, we aggregate the 168 hours of the week and plot them together to obtain some interesting plots

When plotting the count of complaints on an hourly basis during the entire week several patterns emerge. For instance, Encampments show a pronounced spyke all mornings during the week, hoewever this pattern is not consistent when the weekend comes. What the cause of this trend might be is unclear. It is interesting to see how Saturday and Sunday do not usually show any spykes in the morning or afternoon as they do in most cases during workdays. This is probably because people in the weekends do not have a schedule as tight as during workdays and the complaints spread more evenly during the day.

Additionally, we have plotted five interesting categories as violin and box plots below, following the 24 hours of the day. By doing this we have a different perspective of the plots, and it is even more comfortable to just visually compare between one another. The user can select just the categories they want to compare and the shape and gridline offers a nice comfy visual appearance to take out new conclusions. For instance, one can quickly see that Street and Sidewalk Cleaning complaints are more usual in the morning, and then it decreases as the day goes by, and for the category Encampments it can be seen the more pronounced peak in the morning and then a second one late in the evening.

Complaints by source

The focus is now on the different channels the users use to communicate their complaints to SF311. In the plot below can be seen the 7 different sources categorized in the dataset. At first glance we see there are two categories with just a few complaints recorded. These are Email and Other Department, which in fact only have 26 and 350 appearances in the whole dataset, and therefore negligible. Phone and Mobile Open311 App stand out as the two main channels the users prefer when they want to communicate a complaint to the SF311 service.

Similarly to the plot in the previous section, the plot below is designed in order to provide the user with the highest degree of interactivity possible with the total count of complaints by sources over time, on a yearly and monthly basis. The user is invited to explore and play with this tool, finding patterns and insights.

3.1.2 Before and After Covid

The following section is a brief analysis of the impact of covid in various categories. This is done by plotting the behaviour of several complaint types over the week and throughout the hours of the day. The violin plots are divided into two, a blue and yellow half, the first referring to a pre-Covid behaviour and the latter to a post-Covid one. In other words, the analysis for the Covid one focuses for the timeframe between 2020 and 2021, while pre-Covid before 2020.

Graffiti

In the plot above we can see the evolution of the complaints pre vs post covid for the Graffiti category type, and the results are quite interesting. This complaint category has been clearly affected by the Covid-19, or at least there is a notable difference between those dates. As it can be seen, complaints are more evenly distributed along the day as the violin plot shows.

Tree maintenance

We have also plotted the evolution of the Tree Maintenance complaint. By looking at the plot we are not able to spot or conclude there is any significant difference in the behaviour of the citizens while reporting this type of complaint from a before and after covid perspective.

Encampments

The following plot focuses on the Encampments category. From Monday to Friday there's a pretty consistent pattern for the Pre-Covid part, having a pronounced peak in the morning and a second one later on the evening. This makes sense since people would complain more often in those hours for people that is either encamping that night or the previous one. As per the Post-Covid part, the plot shows it is more evenly distributed but we cannot conlude there is any big difference just by looking at this plot.

3.2 Spatiotemporal Patterns

We start by subsetting the relevant columns to decrease the dataset size and make the code run more efficiently.

For all but the section investigating yearly patterns, we chose to focus on the years 2016-2020 to make the analysis more relevant to the present day situation. Further, we exclude the years 2008 and 2021 since these are incomplete years.

Visualising geographical distribution of service complaints.

In the following we will be visualizing the geographical distribution of service complaints within different categories. We will be looking at the development of this spatial distribution over years, months and hours, thereby investigating both the spatial and temporal development of the service complaints.

For this purpose we constructed a scatterplot taking the longitude and latitude of each service complaints and plotting these points on map of San Francisco. When plotting observation we defined two different approaches:

  1. Sample a fixed number of complaints from each category.
  2. Sample a fraction of complaints from each category.

The first approach allows solely for investigation of the development of spatial patterns in the categories over time, whereas the second approach allows us to investigate how the total number of complaints changes during the year, month and hour. Generally a scatterplot is not well suited for visualizing count data, but when used in conjunction with the bar plot in part 3.1.1, the scatter plot allows us to investigate whether this changes in the total number of complaints is due to local or global changes in the geographical distribution of complaints.

Focus categories:

In this part we chose to focus on the categories "Graffiti", "Encampments" and "Tree Maintenance" out of the 101 different categories, to focus our analysis on a few relevant categories. We focused on these specific categories since they cover a diverse set of complaint types in the city of San Francisco.

Notice:

Due to constraints in upload capacity we were not able to upload the full scatterplots to the website. For that reason, we plot around 1500 fewer samples per category on the website compared to the notebook. For that reason, the highlighted patterns might not be as evident on the website as in the notebook. Please feel free to refer to the notebook for the full size visualizations.

Investigating yearly patterns...

Lets start by investigating the yearly pattern for the three categories...

Overall 'Encampments' and 'Graffiti' occurs mostly in the city center. Interestingly however, we see that both 'Graffiti' and 'Encampments' occur more often in the outer regions of the city in the early years of 2009-2011 compared to the later years 2018-2019. For instance, in the southern region of SF there is a cluster of 'Graffiti' complaints in the early years, which is almost completely dissolved by 2015.

Based on this development we can say that 'Graffiti' and 'Encampments' are most likely to occur in the city center of SF. Hence, future efforts to reduce the number of complaints for these categories should be geographically focused at the city center.

'Tree Maintenance' complaints are more evenly distributed across the city and remain so over the years. Hence, there is no clear change in the spatial pattern of this complaint type over the years.

Let's zoom in and investigate developments in the geographical distribution on a monthly basis.

Investigating monthly patterns...

When investigating the barplot in part 3.1.1 we noticed some kind of "beginning-of-year-effect" in the distribution of 'Graffiti' complaints. Every January-February the amount of complaints increase suddenly and keep increasing throughout March-April after which the amount of complaints drop again until it reaches its low in December. This effect is particularly evident in the years 2016-2019, so we will focus on these years in the following.

Let's see if this trend is due to changes in local or global trends in the geographical distribution of complaints...

Graffiti

In the plot we see that "beginning-of-year-effect" results in a global increase the number of complaints. We only see a small increase in local complaint count, for instance the southern area of the city where we just identified a cluster back in the previous section.

All in all, the spatial pattern is very similar and the overall increase in 'Graffiti' complaints is not driven by any local increase. Graffiti complaints are still most prevalent in the city center throughout the months of the year. Hence, to address the increase in the number of 'Graffiti', the city should implement initiatives focused on the city center.

Let's now zoom in once again and investigate the development in the geographical distribution over the hours of the day...

Investigating daily patterns..

In the plot we see the daily cycle of the number of complaints for the categories of 'Graffiti' and 'Encampments'. We chose to focus on these categories after investigating the distribution of 'Tree Maintenance', where we found no interesting pattern. Additionally, this helps limit the size of the plot for efficient integration into the website.

Overall, we see that during the first hours of the day (the night) there are few complaints in all categories. As we approach the 7:00 AM, and people start going to work, we see a steep increase in the number of complaints in all categories. From 7:00 until Noon the number of complaints keep increasing. As we approach 16:00-17:00 PM we see a large decrease in the number of complaints across all categories. By 11:00 PM the number of complaints is at a low point and stays there throughout the night.

This pattern, arising from the natural rythm of our society, is of cause to be expected.

When plotting the distribution of complaints over the hours for each category in the previous section, we found that Graffiti complaints are more tightly and evenly distributed between the hours 8:00 AM to 16:00 PM. Encampment complaints, on the other hand, has a larger peak at 8:00 AM and are slightly more spread out throughout the hours 8:00 AM to 7:00 PM.

Let's see if this pattern coincides with specific geographical patterns for each of the categories separately...

From the plot we see the initial spike in number of complaints at 8:00 AM is arise in the city center. At the same time we see that the drop in number of complaints occur later, around 7:00 PM. From the plot we see that the distribution is stable throughout the day, and the increase in number of complaints arises from a global increase in complaints all over the city. In other words, the increase is not driven by increases in certain geographical locations. Hence, to combat Encampment occurrences our recommendation is that the efforts should be focused in the city center, since this is where they occur throughout the entire day.

Let's look at the Graffiti complaints...

For the Graffiti complaints we see that distribution throughout the hours 8:00 AM to 4:00 PM is focused in the city center. Additionally, we see the drop in complaints at around 4:00 - 5:00 PM. Interestingly, it seems that throughout the evening hours, the number of complaints drop most in the city center, but stay at a higher level in the outer regions of the city, e.g. Golden Gate Park. This might be driven by the fact that people go to the parks in the outer regions of the city during the evening. Contrary to Encampment complaints, which are assumed to be occuring at the time of the complaint, Graffiti complaints can have occurred at any time prior to the complaint. Hence, we can't say with certainty that efforts to combat Graffiti should be focused in the outer regions of the city during the evening hours. But we can say that the higher amount of complaints might indicate a slightly higher than average Graffiti occurrences in the outer regions during the evening hours, while taking into consideration that this might be driven by the fact that more people go to these areas during these hours and file the complaints.

3.3 Cluster Analysis

We now turn our interest towards the different neighbourhoods in San Francisco. The following will investigate how the neighbourhoods differ, when it comes to the distribution of 311 complaints, but also which neighbourhoods have the same profile when it comes to 311 complaints.

To get a current representation of 311 this analysis focuses on the years 2015-2020. Based on the 22 most frequent 311 requests, the San Francisco neighbourhoods are clustered into 10 clusters each consisting of neighbourhoods with similar concerns and requests.

Learning Clusters

We start by creating a Frequency Table:

We pick the top 22 frequent request-categories to focus on and create a dataframe where each row vector is a frequency distribution over request types in a specific neighbourhood. The rows are normalized so each entry correspond to the percentage a given request type account for in a given neighbourhood.

Self-Organising-Maps

To explore the dataset we implemented a Self-Organising-Map (SOM) a kind of Artificial Neural Network used for unsupervised clustering. SOMs are very useful for exploring high dimensional data because they work by mapping high dimensional data onto a 2D grid or Kohonen Layer. SOMs map high dimensional data to 2D by grouping observations much like K-means clustering. By mapping the pivot table consisting of neighbourhood and complaints frequency to a 2D-map, we investigate whether the neighbourhoods can be clustered into meaningful clusters.

We defined a 3x3 grid (9 clusters) becuase this strikes a nice balance between interpretability and flexibility in the representation.

Agglomerative Hierarchical Clustering and PCA

With the SOM above we have explored and learned that an instinctive grouping of neighbourhoods does exist, so we proceed to investigate how many clusters are meaningful for our analysis. We cluster by hierarchical clustering, since a dendogram provides us with a quick overview of where to cut, when balancing between the number of clusters, the within-clusters distance and metric plus the between-cluster distance and linkage method. In addition, we use Principal Component Analysis to project onto the first and second principal component to visually get an idea of which neighbourhoods are grouped. After experimenting with the different parameters, the final choice fell on 10 clusters using the euclidean distance and Ward's linkage method.

Prinicipal Component Analaysis

Results

The data frame constructed below is a table with the information on how each request category is distributed across clusters. For example, you can look at encampments and see that 37 % of all encampments are reported in Cluster 7.. This is the resulting information we achieved from the clustering analysis which we would like to communicate and visualize in the next section.

Visualizing and Exploring Clusters

Geographical Overview:

The code below produces a map of San Francisco where each of the 41 neighbourhoods are outlined and coloured according to which of the 10 clusters it belongs to. Due to the size of the plot, it is not plotted, but if you click here, it will open in a new tab. In the plot you can click on a cluster in the menu on the left, to make it appear or disappear on the map.

You can see that in some clusters the neighbourhoods are geographically close, like Cluster 7 because different areas of a city have different challenges. For example, downtown areas are different from residential areas, since they are used in different ways by different people. If you consider Cluster 6 you will see that it is more spread out, but will have similarities based on what kind of area they are.

Wordclouds

Below you will see a word cloud for each cluster. The words displayed are the type of 311 request each cluster experiences, and the size of the word tells you how frequent a request is within that cluster.

All categories are represented in almost all cluster, but you can quickly see what the prominent issues people have, for example, the most concerning issues in Cluster 4 is Graffiti and Illegal Posting and people visiting Cluster 10, frequently give MUNI feedback . You can also see that some clusters have a broader palette of complaints, like Cluster 1.

Barplot - The distribution of 311 requests in each cluster:

Below you will see a bar plot with a horizontal bar for each focus request, and each color represents a cluster so it can be seen how much each cluster contributes to an issue. For example, can you hover over encampments and see that 37 % of all encampments are reported in Cluster 7. You can also go unpick all clusters except one, and you can see the distribution of complaints in each cluster.

The different types of neighbourhoods in San Francisco:

We will now with aid from the map, word clouds and barplot dive into each cluster and try to explain what 311 requests that characterizes them.

The Parks and Recreations of San Francisco

The odd ones out

The crowded and busy city center

The outskirt of the center

4. Genre

We knew from the beginning that we wanted our visualizations to have an interactive component, engaging the user in exploring the story we are telling about the patterns and developments in the SF311 dataset. For that reason, each section includes several interactive plots. To explore the temporal development and the temporal-spatial development we knew from the start that we wanted the visualizations to include timeline sliders. This way the user is engaged and get the feeling of the discovering the development over time in categories of interest.

Overall, we leverage several genres, including “annotated charts” and “partitioned posters”. Each visualization leans more towards one of the genres depending on whether we want to emphasize a reader-driven or author-driven approach. For instance, in the first part 3.1.1 where we explore the overall temporal development of the categories and complaint sources, we leverage basic “magazine style” visualizations, consisting of only a single frame with no interactive properties. We leverage “magazine style” genre in this section to ensure that the user follows a predefined path, and thereby receives the exact story we want to tell. However, in this part we also have highly interactive visualizations, allowing the reader to explore the data on her own. This is exemplified by what we consider the main visualization of this part, visualizing the development in the total number of complaints for different categories. Here the reader can filter by year, month and zoom in on specific time intervals of interest. Additionally, the reader can select categories of interest with a drop-down menu and decide whether you only want to see the last 3 years, the last 5 years or just all years.

To keep the visual platform coherent, we used the visual narrative tactic of Consistent Visual Platform. By having a coherent visual platform, the user is not overwhelmed by too many changing visual inputs and attention is thereby focused on the most important aspect; the changing content when engaging in the interactive visualizations. This is exemplified by the main plot in the part 3.1.1, where visual platform remains consistent when the content is explored by the user.

5. Visualization

Temporal

To investigate the temporal development in the number of complaints within each category we used interactive bar plots and violin plots. As described in section 4, we emphasized interactive plots to engage the reader. Bar plots were chosen to visualize changes in number of complaints because they excel at efficiently communicating differences in amounts. You can stack a large amount of bars next to each other while maintaining interpretability. Since we are dealing with a very large data set with several categories, years and months, this was a highly desired property for our purpose. In addition, violin plots were implemented to visualize the distribution of complaints during the hours of the day. By plotting several categories next to each other, the violin plot allows for easy interpretation of how the distributions differ.

Temporal-spatial

To visualize the development in the geographical distribution of complaints for different categories we implemented geographical scatterplots, because they excel at visualising distributions of observations. In addition, scatterplots have several parameters such as opacity and markersize which can be adjusted to improve interpretability. The major limitation of scatter plots is that only the distributions of a few categories can be visualized in the same plot. We found that 2 or 3 categories were the maximum while maintaining interpretability.

Neighbourhood clustering

Finally, we implemented choropleth plots to investigate differences in complaint type concentrations in each neighbourhood. Hence, choropleth plots were implemented to visualize the geographical component. To visualize the resulting cluster of our unsupervised clustering analysis, we used word clouds, because they give a quick overview of the clusters. To combine the choropleth map with our findings from our unsupervised clustering, we colored each neighbourhood based on which cluster it belonged to. To visualize the raw distributions of complaint types and how much each cluster describes each complaint type we implemented a bar plot. This bar plot conveys a lot of information and is for diving deeper into the raw data. For instance, if one cluster is selected, the distribution of complaints within that cluster is visualized.

6. Discussion

During the project we have worked in depth with the SF311 dataset. The dataset is very dense and contains vast amounts of information. For that reason, we often found it difficult to set a clear direction for our analysis as well as defining a clearly structured and interesting story. Considering the complexity of the dataset, we think that we really dived deep into the detail

Due to time constraints we did not have the time needed to dive deep into each topic covered in the three different sections. When investigating the temporal development of complaint types, diving into the impact of Covid19 is an entire project in itself. The same is true for the last part investigating the clustering of neighbourhoods. In addition to solely the concentration of complaint types, several other variables could have been included to add more dimensions to the clustering model. One interesting variable to include would be the budget assigned to each neighbourhood and then investigating how that impacts the clustering of neighbourhoods over time.

Throughout this project, interactive and visually appealing plots have been centerpiece and we think we succeeded with this to a large extent. Additionally, we set out to investigate whether neighbourhoods could be clustered into meaningful clusters, and based on our results, we are confident to say that we gained and communicated some valuable insights about this aspect. Finally, a lot of effort was put into the website, striving to present a clear, coherent, engaging and interesting story to the reader. However, given additional resources, all of these areas could of course have been improved even further.

7. Contributions

During this project

However, it is worth noting the great collaboration throughout the course, both in the assignments and in the final project, so that each of us has been aware of and contributed to the project in an equal proportion.

8. References